1.Introduction


What is Craigslist?

“Craigslist” is a renowned name in every household. It is an American-based advertising platform for sections like employment, housing, for sale, items wanted, services, community service, gigs, resumes, and discussion forums. In 1996, it changed to a web-based service and added more classified categories. In 2000, it began to spread to additional American and Canadian cities, and it presently includes 70 nations.

We will be focusing on the vehicle-selling segment of Craigslist. Being one of the top-rated, convenient, and user-friendly platforms where one can buy and sell their vehicles, it can help the sellers to gain a reasonable price. It consists of a huge repository with relevant data and particulars of all used vehicles at a go in one location.

Wikipedia

Vehicles are preferable in the USA over public transport as the use of vehicles is comparatively feasible and practical. We have chosen Craigslist as our topic since it is the top reputed platform for buying and selling vehicles.


2.Objective


The main objective of Craigslist as a project is to analyze the data and find insights to help craigslist find answers to the below-mentioned questions:

How the pricing problem will be solved?

We will be using supervised learning and dividing the data into training and testing sets. Based on training data, we will be predicting the “price” of every vehicle in the testing data set.

To predict our dependent(target) variable “price”, we will be using independent variables like manufacture year, model of car, car’s condition, craigslist region, etc. It will consist of all the relevant information to help craigslist buy the cars at a reasonable yet profitable.


3. Data Collection


Kaggle link for reference

The data is in CSV file format and its size is 1.45 GB. It has a total 426880 number of observations and 26 Variables. Please find below the column names and descriptions:

Data Information
Variables Defination
id Unique id for every listed car
url listing url
region craigslist region
region_url URL for the region
year manufacture year
manufacturer manufacturer of vehicle
model model of vehicle
condition condition of vehicle
cylinders number of cylinders
fuel fuel type of vehicle
odometer miles traveled by vehicle
title_status status of vehicle
transmission transmission of vehicle
VIN vehicle identification number
drive type of drive
size size of vehicle
type generic type of vehicle
paint_color color of vehicle
image_url image URL
description listed description of vehicle
county useless column left in by mistake
state state of listing
lat latitude of listing
long longitude of listing
posting_date posting date of car
price price of car


4.Data Cleaning


Importing required libraries:

library(plotly)
library(dplyr)
library(tidyverse)
library(olsrr)
library(caTools)
library(xgboost)


Loading data.

Craigslist_data = read.csv("Vehicles.csv")


Note: Due to a computational problem, we cannot work on the whole data.

We will take the first 100k records.

Craigslist_data = Craigslist_data[1:100000,]


Number of rows and columns.

nrow(Craigslist_data)
## [1] 100000
ncol(Craigslist_data)
## [1] 26


Cheking summary of data.

summary(Craigslist_data)
##        id                url               region           region_url       
##  Min.   :7.209e+09   Length:100000      Length:100000      Length:100000     
##  1st Qu.:7.309e+09   Class :character   Class :character   Class :character  
##  Median :7.314e+09   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :7.312e+09                                                           
##  3rd Qu.:7.316e+09                                                           
##  Max.   :7.317e+09                                                           
##                                                                              
##      price                year      manufacturer          model          
##  Min.   :0.000e+00   Min.   :1900   Length:100000      Length:100000     
##  1st Qu.:5.999e+03   1st Qu.:2008   Class :character   Class :character  
##  Median :1.400e+04   Median :2013   Mode  :character   Mode  :character  
##  Mean   :1.111e+05   Mean   :2011                                        
##  3rd Qu.:2.599e+04   3rd Qu.:2017                                        
##  Max.   :3.025e+09   Max.   :2022                                        
##                      NA's   :428                                         
##   condition          cylinders             fuel              odometer       
##  Length:100000      Length:100000      Length:100000      Min.   :       0  
##  Class :character   Class :character   Class :character   1st Qu.:   36657  
##  Mode  :character   Mode  :character   Mode  :character   Median :   82543  
##                                                           Mean   :   96450  
##                                                           3rd Qu.:  130000  
##                                                           Max.   :10000000  
##                                                           NA's   :1521      
##  title_status       transmission           VIN               drive          
##  Length:100000      Length:100000      Length:100000      Length:100000     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      size               type           paint_color         image_url        
##  Length:100000      Length:100000      Length:100000      Length:100000     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  description         county           state                lat        
##  Length:100000      Mode:logical   Length:100000      Min.   :-84.12  
##  Class :character   NA's:100000    Class :character   1st Qu.: 33.78  
##  Mode  :character                  Mode  :character   Median : 36.40  
##                                                       Mean   : 36.79  
##                                                       3rd Qu.: 38.89  
##                                                       Max.   : 82.39  
##                                                       NA's   :539     
##       long         posting_date      
##  Min.   :-158.60   Length:100000     
##  1st Qu.:-120.70   Class :character  
##  Median :-116.36   Mode  :character  
##  Mean   :-107.62                     
##  3rd Qu.: -88.27                     
##  Max.   : 173.89                     
##  NA's   :539


First 5 rocords in the data.

head(Craigslist_data,5)
##           id
## 1 7222695916
## 2 7218891961
## 3 7221797935
## 4 7222270760
## 5 7210384030
##                                                                                            url
## 1              https://prescott.craigslist.org/cto/d/prescott-2010-ford-ranger/7222695916.html
## 2       https://fayar.craigslist.org/ctd/d/bentonville-2017-hyundai-elantra-se/7218891961.html
## 3              https://keys.craigslist.org/cto/d/summerland-key-2005-excursion/7221797935.html
## 4 https://worcester.craigslist.org/cto/d/west-brookfield-2002-honda-odyssey-ex/7222270760.html
## 5         https://greensboro.craigslist.org/cto/d/trinity-1965-chevrolet-truck/7210384030.html
##                   region                        region_url price year
## 1               prescott   https://prescott.craigslist.org  6000   NA
## 2           fayetteville      https://fayar.craigslist.org 11900   NA
## 3           florida keys       https://keys.craigslist.org 21000   NA
## 4 worcester / central MA  https://worcester.craigslist.org  1500   NA
## 5             greensboro https://greensboro.craigslist.org  4900   NA
##   manufacturer model condition cylinders fuel odometer title_status
## 1                                                   NA             
## 2                                                   NA             
## 3                                                   NA             
## 4                                                   NA             
## 5                                                   NA             
##   transmission VIN drive size type paint_color image_url description county
## 1                                                                        NA
## 2                                                                        NA
## 3                                                                        NA
## 4                                                                        NA
## 5                                                                        NA
##   state lat long posting_date
## 1    az  NA   NA             
## 2    ar  NA   NA             
## 3    fl  NA   NA             
## 4    ma  NA   NA             
## 5    nc  NA   NA


Column names.

colnames(Craigslist_data)
##  [1] "id"           "url"          "region"       "region_url"   "price"       
##  [6] "year"         "manufacturer" "model"        "condition"    "cylinders"   
## [11] "fuel"         "odometer"     "title_status" "transmission" "VIN"         
## [16] "drive"        "size"         "type"         "paint_color"  "image_url"   
## [21] "description"  "county"       "state"        "lat"          "long"        
## [26] "posting_date"


Data structure.

str(Craigslist_data)
## 'data.frame':    100000 obs. of  26 variables:
##  $ id          : num  7.22e+09 7.22e+09 7.22e+09 7.22e+09 7.21e+09 ...
##  $ url         : chr  "https://prescott.craigslist.org/cto/d/prescott-2010-ford-ranger/7222695916.html" "https://fayar.craigslist.org/ctd/d/bentonville-2017-hyundai-elantra-se/7218891961.html" "https://keys.craigslist.org/cto/d/summerland-key-2005-excursion/7221797935.html" "https://worcester.craigslist.org/cto/d/west-brookfield-2002-honda-odyssey-ex/7222270760.html" ...
##  $ region      : chr  "prescott" "fayetteville" "florida keys" "worcester / central MA" ...
##  $ region_url  : chr  "https://prescott.craigslist.org" "https://fayar.craigslist.org" "https://keys.craigslist.org" "https://worcester.craigslist.org" ...
##  $ price       : num  6000 11900 21000 1500 4900 ...
##  $ year        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ manufacturer: chr  "" "" "" "" ...
##  $ model       : chr  "" "" "" "" ...
##  $ condition   : chr  "" "" "" "" ...
##  $ cylinders   : chr  "" "" "" "" ...
##  $ fuel        : chr  "" "" "" "" ...
##  $ odometer    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ title_status: chr  "" "" "" "" ...
##  $ transmission: chr  "" "" "" "" ...
##  $ VIN         : chr  "" "" "" "" ...
##  $ drive       : chr  "" "" "" "" ...
##  $ size        : chr  "" "" "" "" ...
##  $ type        : chr  "" "" "" "" ...
##  $ paint_color : chr  "" "" "" "" ...
##  $ image_url   : chr  "" "" "" "" ...
##  $ description : chr  "" "" "" "" ...
##  $ county      : logi  NA NA NA NA NA NA ...
##  $ state       : chr  "az" "ar" "fl" "ma" ...
##  $ lat         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ long        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ posting_date: chr  "" "" "" "" ...


4.Data Cleaning


Removing unneeded columns.

Craigslist_data = subset(Craigslist_data,select = -c(1,2,4,15,20,21,26))
colnames(Craigslist_data)
##  [1] "region"       "price"        "year"         "manufacturer" "model"       
##  [6] "condition"    "cylinders"    "fuel"         "odometer"     "title_status"
## [11] "transmission" "drive"        "size"         "type"         "paint_color" 
## [16] "county"       "state"        "lat"          "long"

A total of 7 columns are removed which does not give any useful information. They are id, url, region_url, VIN, image_url, description, posting_date.

We are left with 19 columns.


Checking NA values in data.

Replacing blanks with “NA” so we can handle NA values.


Count of missing Values.

colSums(is.na(Craigslist_data))
##       region        price         year manufacturer        model    condition 
##            0            0          428         4420         1284        38289 
##    cylinders         fuel     odometer title_status transmission        drive 
##        40159          603         1521         2279          514        30474 
##         size         type  paint_color       county        state          lat 
##        72154        19611        29776       100000            0          539 
##         long 
##          539


Percentage of missing Values.

colMeans(is.na(Craigslist_data))
##       region        price         year manufacturer        model    condition 
##      0.00000      0.00000      0.00428      0.04420      0.01284      0.38289 
##    cylinders         fuel     odometer title_status transmission        drive 
##      0.40159      0.00603      0.01521      0.02279      0.00514      0.30474 
##         size         type  paint_color       county        state          lat 
##      0.72154      0.19611      0.29776      1.00000      0.00000      0.00539 
##         long 
##      0.00539

County has 100% NA values. Removing the column.


Craigslist_data = subset(Craigslist_data,select = -c(16))
colnames(Craigslist_data)
##  [1] "region"       "price"        "year"         "manufacturer" "model"       
##  [6] "condition"    "cylinders"    "fuel"         "odometer"     "title_status"
## [11] "transmission" "drive"        "size"         "type"         "paint_color" 
## [16] "state"        "lat"          "long"


Removing rows with NA values to have a clean dataset.

Craigslist_data = na.omit(Craigslist_data)


colSums(is.na(Craigslist_data))
##       region        price         year manufacturer        model    condition 
##            0            0            0            0            0            0 
##    cylinders         fuel     odometer title_status transmission        drive 
##            0            0            0            0            0            0 
##         size         type  paint_color        state          lat         long 
##            0            0            0            0            0            0

Our dataset is clean, without any NA values.

nrow(Craigslist_data)
## [1] 17571

17571 records are left after cleaning.


5.Exploratory Data Analysis


Boxplot of price.

boxplot(Craigslist_data$price)

Range of price.

range(Craigslist_data$price)
## [1]          0 1111111111

Removing outliers:

Q1 <- quantile(Craigslist_data$price, .25)
Q3 <- quantile(Craigslist_data$price, .75)
IQR <- IQR(Craigslist_data$price)

only keep rows in the data frame that have values within 1.5*IQR of Q1 and Q3.

Craigslist_data <- subset(Craigslist_data, Craigslist_data$price> (Q1 - 1.5*IQR) & Craigslist_data$price< (Q3 + 1.5*IQR))
range(Craigslist_data$price)
## [1]     0 36250

Outliers are removed and new price range is 0 to 36250


nrow(Craigslist_data)
## [1] 16529

16529 records left after outlier removal.


Barplot for count of listed vehicles in different states.

data0 =  Craigslist_data %>% count(state, name = "Count")
data0
##    state Count
## 1     ak   516
## 2     al   809
## 3     ar   642
## 4     az  1537
## 5     ca  7945
## 6     co  1736
## 7     ct   929
## 8     dc   469
## 9     de   215
## 10    fl  1731
plot_ly(
  data = data0,
  x = data0$Count,
  y = data0$state,
  name = "SF Zoo",
  type = "bar", marker = list(color = 'rgb(158,202,225)',
                      line = list(color = 'rgb(8,48,107)',
                                  width = 1.5))
) %>% 
layout(yaxis = list(categoryorder = "total ascending", title = 'State'),title = 'State-wise count of listed vehicles', xaxis = list(title = 'Count'))

It shows that California has the highest count of listed vehicles and Delaware has the lowest count of listed vehicles.


Pie chart to find the percentage of vehicles as per their condition.

data1 =  Craigslist_data %>% count(condition, name = "Count")
data1
##   condition Count
## 1 excellent  8727
## 2      fair   666
## 3      good  4866
## 4  like new  2130
## 5       new    74
## 6   salvage    66
Percentage= (data1$Count/sum(data1$Count))*100
data1 = mutate(data1, Percentage = Percentage)
data1
##   condition Count Percentage
## 1 excellent  8727 52.7981124
## 2      fair   666  4.0292819
## 3      good  4866 29.4391675
## 4  like new  2130 12.8864420
## 5       new    74  0.4476980
## 6   salvage    66  0.3992982
Condition = data1$condition
Percentage = data1$Percentage

pie1 = ggplot(data = data1, aes(x="", y = Percentage, fill = Condition)) +
       geom_col(color = "White") +
       coord_polar("y", start = 0) + 
       geom_text(aes(label = paste0(round(Percentage,2), "%")), 
                          position = position_stack(vjust = 0.6), size = 3) +
       theme(panel.background = element_blank(),
             axis.line = element_blank(),
             axis.text = element_blank(),
             axis.ticks = element_blank(),
             axis.title = element_blank(), 
             plot.title = element_text(hjust = 0.5, size = 18)) +
       ggtitle("Percentage of vehicles as per the Condition") 
       

pie1


We can see 52.8% of vehicles are in Excellent condition and 29.44% vehciles are in Good condition. A very small percentage of vehicles are Salvage.


Manufacturers with the highest price listed vehicles.

data2 = Craigslist_data %>%
  group_by(manufacturer) %>%
  summarise(max_price=max(price))

data2 = data2[order(-data2$max_price), ]
data2
## # A tibble: 41 × 2
##    manufacturer max_price
##    <chr>            <dbl>
##  1 gmc              36250
##  2 ford             36000
##  3 jeep             36000
##  4 mitsubishi       36000
##  5 pontiac          36000
##  6 ram              36000
##  7 toyota           36000
##  8 chevrolet        35999
##  9 dodge            35950
## 10 bmw              35900
## # … with 31 more rows
data2$manufacturer <- factor(data2$manufacturer,                                   
                  levels = data2$manufacturer[order(data2$max_price, decreasing = TRUE)])

p <-  ggplot(data=data2, aes(x= manufacturer, y=max_price, fill= manufacturer)) +
    geom_bar(stat="identity")+ ggtitle("Maximum price vehicle for each manufacturer")

fig <- ggplotly(p)

fig

GMC tops the charts with the highest price vehicle among all.

Ford is in the second spot.


Manufacturers with the lowest price listed vehicles.

data3 = Craigslist_data %>%
  group_by(manufacturer) %>%
  summarise(min_price=min(price))

data3 = data3[order(data3$min_price), ]
data3
## # A tibble: 41 × 2
##    manufacturer min_price
##    <chr>            <dbl>
##  1 acura                0
##  2 audi                 0
##  3 bmw                  0
##  4 buick                0
##  5 cadillac             0
##  6 chevrolet            0
##  7 chrysler             0
##  8 dodge                0
##  9 ferrari              0
## 10 fiat                 0
## # … with 31 more rows
data3$manufacturer <- factor(data3$manufacturer,                                   
                  levels = data3$manufacturer[order(data3$min_price, decreasing = FALSE)])

p <-  ggplot(data=data3, aes(x= manufacturer, y=min_price, fill= manufacturer)) +
    geom_bar(stat="identity")+ ggtitle("Minimum price vehicle for each manufacturer")

fig <- ggplotly(p)

fig

We can see many manufacturers have minimum price vehcile listed at 0$.

We can see a lowest price listing for Saturn manufacturer with a genuine listed price of $300.


6.Splitting data into Train and Test


As we have many categorical columns; we must convert them into factors to implement linear regression as we can pass only numerical variables. Converting categorical variables into factors.

Craigslist_data$region = as.factor(Craigslist_data$region)
Craigslist_data$year = as.factor(Craigslist_data$year)
Craigslist_data$manufacturer = as.factor(Craigslist_data$manufacturer)
Craigslist_data$model = as.factor(Craigslist_data$model)
Craigslist_data$condition = as.factor(Craigslist_data$condition)
Craigslist_data$cylinders = as.factor(Craigslist_data$cylinders)
Craigslist_data$fuel = as.factor(Craigslist_data$fuel)
Craigslist_data$title_status = as.factor(Craigslist_data$title_status)
Craigslist_data$transmission = as.factor(Craigslist_data$transmission)
Craigslist_data$drive = as.factor(Craigslist_data$drive)
Craigslist_data$size = as.factor(Craigslist_data$size)
Craigslist_data$type = as.factor(Craigslist_data$type)
Craigslist_data$paint_color = as.factor(Craigslist_data$paint_color)
Craigslist_data$state = as.factor(Craigslist_data$state)


We can see factors for all categorical columns.

str(Craigslist_data)
## 'data.frame':    16529 obs. of  18 variables:
##  $ region      : Factor w/ 75 levels "anchorage / mat-su",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ price       : num  15000 19900 14000 22500 15000 3000 9500 17500 0 0 ...
##  $ year        : Factor w/ 93 levels "1918","1923",..: 85 76 84 73 89 76 75 80 90 91 ...
##  $ manufacturer: Factor w/ 41 levels "acura","alfa-romeo",..: 14 14 17 14 11 9 9 39 8 8 ...
##  $ model       : Factor w/ 3667 levels "-benz s430","(cng) 2500 express van",..: 1477 1587 2421 1625 852 3309 3309 1675 1439 1439 ...
##  $ condition   : Factor w/ 6 levels "excellent","fair",..: 1 3 1 3 1 3 1 3 4 4 ...
##  $ cylinders   : Factor w/ 8 levels "10 cylinders",..: 6 7 6 7 7 6 6 6 6 6 ...
##  $ fuel        : Factor w/ 5 levels "diesel","electric",..: 3 1 3 1 3 3 3 3 3 3 ...
##  $ odometer    : int  128000 88000 95000 144700 90000 176144 30376 201300 68472 69125 ...
##  $ title_status: Factor w/ 6 levels "clean","lien",..: 1 1 1 1 5 1 1 1 1 1 ...
##  $ transmission: Factor w/ 3 levels "automatic","manual",..: 1 1 1 2 1 1 1 2 1 1 ...
##  $ drive       : Factor w/ 3 levels "4wd","fwd","rwd": 3 1 2 3 3 2 2 1 3 3 ...
##  $ size        : Factor w/ 4 levels "compact","full-size",..: 2 2 2 2 3 3 3 2 2 2 ...
##  $ type        : Factor w/ 13 levels "bus","convertible",..: 11 8 5 11 9 5 5 6 12 12 ...
##  $ paint_color : Factor w/ 12 levels "black","blue",..: 1 2 10 11 6 10 2 1 11 11 ...
##  $ state       : Factor w/ 10 levels "ak","al","ar",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ lat         : num  32.6 32.5 32.6 32.6 32.8 ...
##  $ long        : num  -85.5 -85.5 -85.5 -85.4 -85.8 ...


Using 70% of the dataset as training set and 30% as testing set.

set.seed(1)
sample <- sample.split(Craigslist_data$price, SplitRatio = 0.7)
train  <- subset(Craigslist_data, sample == TRUE)
test   <- subset(Craigslist_data, sample == FALSE)
dim(train)
## [1] 11703    18
dim(test)
## [1] 4826   18


7.Modeling

Note: Due to converting the categorical column into factors, we have a lot many columns, that’s why it into possible to show the whole summary of a model in an HTML file. We will be using str(summary) to show the results.


Model1: Model with all independent variables.

model1 <- lm(price~., data = train)
str(summary(model1))
## List of 11
##  $ call         : language lm(formula = price ~ ., data = train)
##  $ terms        :Classes 'terms', 'formula'  language price ~ region + year + manufacturer + model + condition + cylinders +      fuel + odometer + title_status + tran| __truncated__ ...
##   .. ..- attr(*, "variables")= language list(price, region, year, manufacturer, model, condition, cylinders, fuel,      odometer, title_status, transmiss| __truncated__ ...
##   .. ..- attr(*, "factors")= int [1:18, 1:17] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:18] "price" "region" "year" "manufacturer" ...
##   .. .. .. ..$ : chr [1:17] "region" "year" "manufacturer" "model" ...
##   .. ..- attr(*, "term.labels")= chr [1:17] "region" "year" "manufacturer" "model" ...
##   .. ..- attr(*, "order")= int [1:17] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(price, region, year, manufacturer, model, condition, cylinders, fuel,      odometer, title_status, transmiss| __truncated__ ...
##   .. ..- attr(*, "dataClasses")= Named chr [1:18] "numeric" "factor" "factor" "factor" ...
##   .. .. ..- attr(*, "names")= chr [1:18] "price" "region" "year" "manufacturer" ...
##  $ residuals    : Named num [1:11703] 2.68e+03 8.32e+03 1.38e+04 -4.11e-09 8.31e+03 ...
##   ..- attr(*, "names")= chr [1:11703] "32" "60" "66" "74" ...
##  $ coefficients : num [1:3310, 1:4] 23054 -13705 -4846 -6019 -5114 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3310] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:3354] FALSE FALSE FALSE FALSE FALSE FALSE ...
##   ..- attr(*, "names")= chr [1:3354] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
##  $ sigma        : num 5128
##  $ df           : int [1:3] 3310 8393 3354
##  $ r.squared    : num 0.745
##  $ adj.r.squared: num 0.644
##  $ fstatistic   : Named num [1:3] 7.4 3309 8393
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:3310, 1:3310] 6.028 -0.369 -0.245 -0.356 -0.262 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3310] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
##   .. ..$ : chr [1:3310] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
##  - attr(*, "class")= chr "summary.lm"

We are getting an R square value of 0.74, It means that 74% of the variation in the price is explained by the independent variables. The adjusted R square value is 0.64, which is more important in multiple regression because it looks at whether additional input variables are contributing to the model.


Let’s try forward selection for our variable selection.

#Forward_sec = ols_step_forward_p(model1,penter=.05)
#Forward_sec

After running above step for 2 hours, we haven’t received any results because of variables like model,year,region. They have 3835-factor values, which means it added 3835 different columns.

Let us try modeling and forward selection without these columns.


Model2: Model without variables - model,year,region.

train2 = subset(train,select = -c(1,3,5))
model2 <- lm(price~., data = train2)
str(summary(model2))
## List of 11
##  $ call         : language lm(formula = price ~ ., data = train2)
##  $ terms        :Classes 'terms', 'formula'  language price ~ manufacturer + condition + cylinders + fuel + odometer + title_status +      transmission + drive + size | __truncated__
##   .. ..- attr(*, "variables")= language list(price, manufacturer, condition, cylinders, fuel, odometer, title_status,      transmission, drive, size, typ| __truncated__
##   .. ..- attr(*, "factors")= int [1:15, 1:14] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:15] "price" "manufacturer" "condition" "cylinders" ...
##   .. .. .. ..$ : chr [1:14] "manufacturer" "condition" "cylinders" "fuel" ...
##   .. ..- attr(*, "term.labels")= chr [1:14] "manufacturer" "condition" "cylinders" "fuel" ...
##   .. ..- attr(*, "order")= int [1:14] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(price, manufacturer, condition, cylinders, fuel, odometer, title_status,      transmission, drive, size, typ| __truncated__
##   .. ..- attr(*, "dataClasses")= Named chr [1:15] "numeric" "factor" "factor" "factor" ...
##   .. .. ..- attr(*, "names")= chr [1:15] "price" "manufacturer" "condition" "cylinders" ...
##  $ residuals    : Named num [1:11703] -4173 2152 3298 3329 -3222 ...
##   ..- attr(*, "names")= chr [1:11703] "32" "60" "66" "74" ...
##  $ coefficients : num [1:104, 1:4] 44124 11643 20036 1078 674 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:104] "(Intercept)" "manufactureralfa-romeo" "manufactureraston-martin" "manufactureraudi" ...
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:104] FALSE FALSE FALSE FALSE FALSE FALSE ...
##   ..- attr(*, "names")= chr [1:104] "(Intercept)" "manufactureralfa-romeo" "manufactureraston-martin" "manufactureraudi" ...
##  $ sigma        : num 7151
##  $ df           : int [1:3] 104 11599 104
##  $ r.squared    : num 0.314
##  $ adj.r.squared: num 0.308
##  $ fstatistic   : Named num [1:3] 51.6 103 11599
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:104, 1:104] 0.35976 -0.01072 -0.01232 -0.00852 -0.00834 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:104] "(Intercept)" "manufactureralfa-romeo" "manufactureraston-martin" "manufactureraudi" ...
##   .. ..$ : chr [1:104] "(Intercept)" "manufactureralfa-romeo" "manufactureraston-martin" "manufactureraudi" ...
##  - attr(*, "class")= chr "summary.lm"

The adjusted R square value decreased to 0.30. It tells us that these variables are important and we cannot remove them.


Let’s try forward selection without these varaibles.

Forward_sec = ols_step_forward_p(model2,penter=.05)
Forward_sec
## 
##                                   Selection Summary                                   
## -------------------------------------------------------------------------------------
##         Variable                      Adj.                                               
## Step      Entered       R-Square    R-Square      C(p)           AIC          RMSE       
## -------------------------------------------------------------------------------------
##    1    type              0.1141      0.1132    3288.6333    243862.7258    8096.8747    
##    2    condition         0.1960      0.1949    1904.2575    242736.7956    7714.9536    
##    3    state             0.2273      0.2256    1377.6022    242290.8460    7566.4493    
##    4    drive             0.2456      0.2438    1068.8165    242013.2604    7476.6070    
##    5    odometer          0.2604      0.2585     821.7322    241784.5834    7403.6006    
##    6    transmission      0.2736      0.2717     599.3190    241576.7876    7337.5378    
##    7    fuel              0.2847      0.2826     413.3343    241404.3735    7282.4453    
##    8    manufacturer      0.2967      0.2922     212.9842    241287.0131    7233.7085    
##    9    size              0.3019      0.2972     127.0926    241206.2409    7207.8709    
##   10    paint_color       0.3084      0.3031      19.7023    241119.3374    7177.8090    
##   11    cylinders         0.3117      0.3061     -35.4119    241076.0721    7162.4282    
##   12    long              0.3136      0.3079     -65.1230    241046.1558    7152.9763    
##   13    title_status      0.3144      0.3083     -75.8259    241043.3462    7150.6033    
## -------------------------------------------------------------------------------------

The results show us that after adding title status to the model, there is no further improvement in the Adjusted square value. We cannot take a forward selection approach in this case.

We must try and test. Removing variables lat long, state as they will be correlated with region in our view.



Model3: Model without variables - lat long, state.

train3 = subset(train,select = -c(16,17,18))
model3 <- lm(price~., data = train3)
str(summary(model3))
## List of 11
##  $ call         : language lm(formula = price ~ ., data = train3)
##  $ terms        :Classes 'terms', 'formula'  language price ~ region + year + manufacturer + model + condition + cylinders +      fuel + odometer + title_status + tran| __truncated__ ...
##   .. ..- attr(*, "variables")= language list(price, region, year, manufacturer, model, condition, cylinders, fuel,      odometer, title_status, transmiss| __truncated__
##   .. ..- attr(*, "factors")= int [1:15, 1:14] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:15] "price" "region" "year" "manufacturer" ...
##   .. .. .. ..$ : chr [1:14] "region" "year" "manufacturer" "model" ...
##   .. ..- attr(*, "term.labels")= chr [1:14] "region" "year" "manufacturer" "model" ...
##   .. ..- attr(*, "order")= int [1:14] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(price, region, year, manufacturer, model, condition, cylinders, fuel,      odometer, title_status, transmiss| __truncated__
##   .. ..- attr(*, "dataClasses")= Named chr [1:15] "numeric" "factor" "factor" "factor" ...
##   .. .. ..- attr(*, "names")= chr [1:15] "price" "region" "year" "manufacturer" ...
##  $ residuals    : Named num [1:11703] 2.67e+03 8.25e+03 1.40e+04 -3.94e-09 8.34e+03 ...
##   ..- attr(*, "names")= chr [1:11703] "32" "60" "66" "74" ...
##  $ coefficients : num [1:3308, 1:4] 15640 -10556 -3417 -2806 -2936 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3308] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:3343] FALSE FALSE FALSE FALSE FALSE FALSE ...
##   ..- attr(*, "names")= chr [1:3343] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
##  $ sigma        : num 5129
##  $ df           : int [1:3] 3308 8395 3343
##  $ r.squared    : num 0.745
##  $ adj.r.squared: num 0.644
##  $ fstatistic   : Named num [1:3] 7.41 3307 8395
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:3308, 1:3308] 5.22642 -0.00284 -0.0076 -0.00574 -0.0053 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3308] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
##   .. ..$ : chr [1:3308] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
##  - attr(*, "class")= chr "summary.lm"

We see very little improvement in the R square value. Adjusted R square remains the same which means that these variables are insignificant and do not contribute.

We will go forward and use Model 3 for prediction.

test3 = subset(test,select = -c(16,17,18))



Residual histogram

modelResiduals <- as.data.frame(residuals(model3)) 
ggplot(modelResiduals, aes(residuals(model3))) +
  geom_histogram(fill='deepskyblue', color='black')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see here that the residual plot shows the normal distribution. It proves the assumption of linear regression that residuals of the model are normally distributed.


In test data, replaced new levels by NA to solve problem of missing levels in test data set.

test3$year[which(!(test3$year %in% unique(train3$year)))] <- NA
test3$model[which(!(test3$model %in% unique(train3$model)))] <- NA
test3 = na.omit(test3)


Prediction and RMSE of Linear regression.

preds = predict(model3, test3)
## Warning in predict.lm(model3, test3): prediction from a rank-deficient fit may
## be misleading


Actual vs Predicted

modelEval = cbind(test3$price, preds)
colnames(modelEval) = c('Actual', 'Predicted')
modelEval = as.data.frame(modelEval)
modelEval$Predicted = round(modelEval$Predicted)
head(modelEval)
##     Actual Predicted
## 56   19900     10555
## 111   9500     -3651
## 139   4000     -1943
## 220   2500      -821
## 223  12500     14195
## 245      0     32661


RMSE of Linear regression.

mse1 = mean((modelEval$Actual - modelEval$Predicted)^2) 
rmse1 = sqrt(mse1)
rmse1
## [1] 5589.309

Root mean sqaure error for linear regression model is 5589.30



Model4: XG BOOST

For comparison, we want to use a more advanced algorithm. Let us try to predict price using XGBoost.

Define predictor and response variables in training and testing set.

train_x = data.matrix(train3[, -2])
train_y = train3[,2]

test_x = data.matrix(test3[, -2])
test_y = test3[, 2]

Fit XGBoost model to training set

xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)
model4 = xgb.train(data = xgb_train, max.depth = 3, nrounds = 700, verbose = 0)

Prediction and RMSE of XGB.

preds2 = predict(model4, xgb_test)


Actual vs Predicted

modelEval2 <- cbind(test_y, preds2)
colnames(modelEval2) <- c('Actual', 'Predicted')
modelEval2 <- as.data.frame(modelEval2)
modelEval2$Predicted <- round(modelEval2$Predicted)
head(modelEval2)
##   Actual Predicted
## 1  19900     18554
## 2   9500      7217
## 3   4000      3886
## 4   2500      2567
## 5  12500     11954
## 6      0     17950


RMSE of XGB Model.

mse2 = mean((test_y - preds2)^2) 
rmse2 = sqrt(mse2)
rmse2
## [1] 4424.406

Root mean sqaure error for XGB model is 4424.40.

We are getting better results in comparison to linear regression model.


Variable importance plot

importance_matrix = xgb.importance(colnames(xgb_train), model = model4)
xgb.plot.importance(importance_matrix[1:14,])


The above graph shows that year is the most important variable followed by odometer.


8.Conclusions


1. California has the highest number of listed vehicles.

2. GMC has the highest price listed vehicle and Saturn has the lowest price listed vehicle.

3. 52.8% of listed vehicles are in excellent condition and 29.44% of vehicles are in good condition. Only 0.39% of vehicles are Salvage.

4. Manufacture Year and Odometer are the most important variable, which totally makes sense.

5. We predicted the price using two different algorithms. XGBoost beats linear regression in terms of root mean square error.


9. Scope


If we have the better processing power, we can use more data, which will provide more information and produce better results.

With additional time, we can tune the parameters, which can improve the results.